← Back to Contents
Note: This page's design and presentation have been enhanced using Claude (Anthropic's AI assistant) to improve visual quality and educational experience.
Week 2 • Sub-Lesson 1

🏗️ LLM Architecture Deep Dive

Understanding the technical foundations of modern large language models

What We'll Cover

In Week 1, we introduced transformers as "attention machines" and explored a little how they differ from earlier architectures. This session goes deeper: we'll dive back into the transformer architecture that powers models like GPT, Claude, and Gemini, understand how attention mechanisms actually work, and explore cutting-edge innovations like Mixture of Experts that make modern models more efficient.

By the end of this session you should have a greater appreciation of how LLMs work—and why architectural choices matter for research applications.

Some of this may feel uncomfortably mathematical, so if you find yourself worrying about this, you can try to just get the overall gist of what is going on.

🧩 Transformer Architecture Fundamentals

Let's revisit the transformer architecture with more technical depth. While Week 1 gave us the intuition, here we'll understand the actual mechanisms.

This video is in addition to the videos from last week and is also a little technical, but it does give a slightly different description of the 3blue1brown explanations.

📹 Take a look at the 3brown1blue videos before this one, but here is a lecture series on the transformer architecture.

The Core Components

  • Token embeddings: Converting discrete tokens into continuous vector representations
  • Positional encoding: Injecting information about token position in the sequence
  • Attention layers: The heart of the model—learning which tokens matter
  • Feed-forward networks: Processing attended information within each layer
  • Layer normalization: Stabilizing training across deep networks
  • Residual connections: Allowing gradients to flow through many layers

Decoder-Only Architecture

Modern LLMs (GPT, Claude, LLaMA) use decoder-only architectures rather than the original encoder-decoder design.

  • Autoregressive generation: Predict one token at a time, left to right
  • Causal masking: Each token can only attend to previous tokens
  • Simpler architecture: No cross-attention needed
  • Unified pre-training: Single objective (next-token prediction) for all training

💡 Why decoder-only?

Decoder-only models proved more scalable and effective for general-purpose language understanding and generation. The encoder-decoder design is still used for specific tasks like translation.

📐 The Forward Pass: Input to Output

Here's what happens when you send text to an LLM:

  1. Tokenization: Text → token IDs (e.g., "The cat" → [464, 3797])
  2. Embedding lookup: Each token ID → dense vector (e.g., 4096 dimensions)
  3. Positional encoding: Add position information to each token vector
  4. Transformer layers (repeated N times):
    • Multi-head self-attention: tokens attend to previous context
    • Feed-forward network: process attended representations
    • Residual connections + layer norm after each sub-layer
  5. Final layer norm: Normalize output representations
  6. Output projection: Map to vocabulary size (e.g., 50,000 tokens)
  7. Sampling: Choose next token based on probability distribution

Here is the original paper itself, in case you are feeling particularly brave: "Attention Is All You Need" (Vaswani et al., 2017) 

👁️ Attention Mechanisms in Detail

Attention is the fundamental innovation that makes transformers work. Let's understand the different types and how they've evolved.

🔑 The Attention Intuition

When you read "The animal didn't cross the street because it was too tired," your brain automatically knows "it" refers to "the animal," not "the street." You attend to relevant context.

Self-attention mechanisms let transformers do the same thing: for each token, learn which other tokens in the context are relevant, and weight them accordingly when building representations.

Self-Attention

The core mechanism: each token computes attention scores with every other token in the sequence.

  • Query, Key, Value: Each token produces three vectors through learned projections
  • Attention scores: Dot product of Query with all Keys (how relevant is each token?)
  • Softmax normalization: Convert scores to probability distribution
  • Weighted sum: Combine Values using attention weights

🔍 Scaled Dot-Product

Attention scores are scaled by √d_k (square root of key dimension) to prevent extremely small gradients in softmax for large embedding sizes.

Multi-Head Attention

Instead of one attention mechanism, use many in parallel—each learns different patterns.

  • Multiple heads: 32-96 attention heads in modern LLMs
  • Specialized patterns: Each head might learn syntax, semantics, or other relationships
  • Concatenation: Outputs from all heads combined and projected
  • Richer representations: Capture multiple types of dependencies simultaneously

Cross-Attention

Used in encoder-decoder models (and multimodal systems): attend to a different sequence.

  • Query from decoder: "What am I trying to generate?"
  • Keys/Values from encoder: "What input information is available?"
  • Translation example: Decoder attends to source language while generating target
  • Vision-language models: Text decoder attends to image features

📹 I don't think that you can do much better than 3brown1blue to understand attention

⚡ Modern Attention Variants

Standard multi-head attention is computationally expensive. Modern models use optimized variants:

Variant Key Idea Benefit Used In
Multi-Head Attention (MHA) Each head has its own Q, K, V projections Rich representations Original transformers, GPT-3
Multi-Query Attention (MQA) Share K, V across heads; unique Q per head Faster inference, less memory PaLM, some Llama variants
Grouped-Query Attention (GQA) Multiple heads share K, V in groups Balance between MHA and MQA Llama 2, Mistral, GPT-4 (rumored)

Research implication: GQA has become the dominant choice for new models—it provides most of MHA's quality with much better inference efficiency.

📍 Positional Encoding: Teaching Position

Transformers process all tokens in parallel (unlike RNNs which are sequential). But word order matters! Positional encoding solves this problem.

Absolute Positional Encoding

Original approach: add a position-specific vector to each token embedding.

  • Sinusoidal encoding: Original transformer used sin/cos functions of different frequencies
  • Learned positions: Some models learn position embeddings during training (like GPT)
  • Fixed context: Limited to maximum sequence length seen during training
  • Issue: Doesn't extrapolate well to longer sequences

Relative Positional Encoding

Modern approach: encode the distance between tokens rather than absolute position.

  • RoPE (Rotary Position Embedding): Rotate Q and K vectors based on position—used in LLaMA, Mistral, GPT-NeoX
  • ALiBi (Attention with Linear Biases): Add bias to attention scores based on distance—used in BLOOM, MPT
  • Length extrapolation: Can handle sequences longer than training context
  • Better generalization: Understands "distance" concept rather than memorizing positions

💡 Why RoPE Dominates

RoPE (Rotary Position Embedding) has become the standard for new LLMs because it:

  • Encodes relative positions naturally through rotation in complex space
  • Allows models to extrapolate to longer contexts than seen in training
  • Maintains computational efficiency (applied during Q/K computation)
  • Empirically outperforms alternatives on long-context tasks

📄 Again, this is a little mathsy, but see if you can use this, along with ChatGPT to understand positional encodings

Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi

🔀 Modern Architectural Innovations

The frontier of LLM architecture isn't just about making models bigger—it's about making them smarter. Here are the key innovations driving 2024-2026 models.

💡 The Efficiency Revolution

Modern LLM research focuses on parameter efficiency: getting better performance without proportionally increasing compute costs. The key insight: not every parameter needs to activate for every input.

This shift—from "bigger is better" to "smarter is better"—is driven by Mixture of Experts architectures, attention optimizations, and clever training techniques.

🎯 Mixture of Experts (MoE)

MoE is perhaps the most important architectural innovation in modern LLMs. Instead of using all parameters for every token, route each token to a subset of specialized "expert" networks.

How MoE Works:
  1. Expert networks: Instead of one feed-forward network per layer, have 8-64 expert FFNs
  2. Router network: Small learned network decides which experts process each token
  3. Sparse activation: Only top-K experts (typically 2-8) activate per token
  4. Load balancing: Ensure tokens distribute roughly evenly across experts
  5. Combination: Outputs from active experts weighted and summed

MoE Benefits

  • Massive parameter count: 100B+ total parameters with only 10-20B active per token
  • Inference efficiency: Computational cost based on active parameters, not total
  • Specialization: Different experts can learn different domains/patterns
  • Scaling: Add more experts without proportional compute increase

MoE Challenges

  • Training complexity: Load balancing is tricky—some experts might be underutilized
  • Memory requirements: All experts must fit in GPU memory even if only few are active
  • Communication overhead: Routing adds latency in distributed systems
  • Instability: Careful tuning needed to prevent expert collapse

📊 MoE in Production Models

Mixtral 8x7B: 8 experts, 2 active per token → 47B total params, 13B active → performs like 47B dense model at cost of 13B

DeepSeek-V3: 256 experts, 8 active per token → 671B total params, 37B active → competitive with GPT-4 at fraction of inference cost

GPT-4 (rumored): Speculated to use MoE with 8-16 experts, explaining its size vs. inference speed

📹 Mixture of Experts explanation

📄 DeepSeek-V3 Github repo

Here is an open-weights Mixture of Experts model that you can, in theory, download and run (though I wouldn't do this on your laptop.

📏 Understanding Model Scale

When we say "GPT-4 has 1.76 trillion parameters" or "LLaMA 3 70B," what do these numbers actually mean? And is bigger always better?

Parameter Count Breakdown

Parameters are the learned weights in the neural network. They're distributed across:

  • Embedding layers: Token + position embeddings (vocab_size × embedding_dim)
  • Attention layers: Q, K, V, output projections for each head, each layer
  • Feed-forward networks: Two linear layers per transformer layer (typically 4× hidden size)
  • Layer norms: Small contribution (scale + shift per layer)

🔢 Quick Math

A 7B parameter model might have: 32 layers × (12 attention heads × 4096 dim + 4× FFN expansion) ≈ 7 billion parameters

Parameters vs. Active Parameters

For MoE models, these are different concepts:

  • Total parameters: All weights in all experts (what's reported as "model size")
  • Active parameters: Weights used for any single forward pass
  • Example: Mixtral 8x7B has 47B total params but only 13B active per token
  • Inference cost: Determined by active parameters, not total

Dense vs. Sparse Architectures

  • Dense models: All parameters active for every input (GPT-3, Claude, LLaMA 2)
  • Sparse models (MoE): Subset of parameters active per input (Mixtral, DeepSeek-V3, Switch Transformer)
  • Trade-off: Sparse models offer better parameter efficiency but increased training complexity
  • Trend: Major labs moving toward sparse architectures for flagship models

🎯 When Bigger ≠ Better

The relationship between model size and performance is nuanced:

Scenario Smaller Model Wins Larger Model Wins
Latency-sensitive applications ✓ Faster inference, lower latency ✗ Slower, requires more compute
Resource-constrained deployment ✓ Runs on smaller GPUs, edge devices ✗ Needs high-end infrastructure
Narrow domain tasks ✓ Can be fine-tuned effectively Diminishing returns
Complex reasoning Limited capability ✓ Better at multi-step problems
Rare/specialized knowledge Likely to hallucinate ✓ More knowledge encoded in parameters
Few-shot learning Requires more examples ✓ Better in-context learning

Research principle: Match model size to your task. A well-trained 7B model often outperforms a poorly-prompted 70B model. And for many research tasks (data analysis, writing assistance, literature review), mid-size models are sufficient.

🔮 The Current Frontier (Feb 2026)

Dense models: Have been largely superseded by MoE models. However ChatGPT 4o seems to have been a dense model with around 1T parameters

MoE models: Many 200B-700B total parameters with 30B-50B active (DeepSeek-V3, Mixtral variants). Claude 4.6 may be an MoE model with around 700B parameters. ChatGPT 5.2 may have over 10T parameters, with fewer active parameters.

Small models: 7B-13B models (LLaMA 3, Mistral 7B) remain popular for local deployment, fine-tuning, and research experimentation

Trend: Architectural efficiency improvements (MoE, GQA, better training) matter more than raw parameter count

📄 Scaling Laws for LLMS

Understanding the current state of LLM scaling and the future of AI research

📚 Summary & Key Takeaways

You now understand the technical architecture of modern LLMs:

  • Transformer fundamentals: Decoder-only architecture, layer structure, residual connections
  • Attention mechanisms: Self-attention, multi-head variants (MHA/MQA/GQA), how tokens learn relevance
  • Positional encoding: RoPE and ALiBi enable models to understand sequence order and extrapolate length
  • Mixture of Experts: Sparse activation allows massive models with efficient inference
  • Scale considerations: Bigger isn't always better—match model to task, consider active vs. total parameters

Next session (Week 2.2): We'll explore how these architectures are actually trained—the pre-training process, optimization techniques, and the computational resources required to create LLMs from scratch.